Goto

Collaborating Authors

 apollo 11




A Practical Guide for Evaluating LLMs and LLM-Reliant Systems

Rudd, Ethan M., Andrews, Christopher, Tully, Philip

arXiv.org Artificial Intelligence

Recent advances in generative AI have led to remarkable interest in using systems that rely on large language models (LLMs) for practical applications. However, meaningful evaluation of these systems in real-world scenarios comes with a distinct set of challenges, which are not well-addressed by synthetic benchmarks and de-facto metrics that are often seen in the literature. We present a practical evaluation framework which outlines how to proactively curate representative datasets, select meaningful evaluation metrics, and employ meaningful evaluation methodologies that integrate well with practical development and deployment of LLM-reliant systems that must adhere to real-world requirements and meet user-facing needs.


DnDScore: Decontextualization and Decomposition for Factuality Verification in Long-Form Text Generation

Wanner, Miriam, Van Durme, Benjamin, Dredze, Mark

arXiv.org Artificial Intelligence

The decompose-then-verify strategy for verification of Large Language Model (LLM) generations decomposes claims that are then independently verified. Decontextualization augments text (claims) to ensure it can be verified outside of the original context, enabling reliable verification. While decomposition and decontextualization have been explored independently, their interactions in a complete system have not been investigated. Their conflicting purposes can create tensions: decomposition isolates atomic facts while decontextualization inserts relevant information. Furthermore, a decontextualized subclaim presents a challenge to the verification step: what part of the augmented text should be verified as it now contains multiple atomic facts? We conduct an evaluation of different decomposition, decontextualization, and verification strategies and find that the choice of strategy matters in the resulting factuality scores. Additionally, we introduce DnDScore, a decontextualization aware verification method which validates subclaims in the context of contextual information.


Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

Lupidi, Alisia, Gemmell, Carlos, Cancedda, Nicola, Dwivedi-Yu, Jane, Weston, Jason, Foerster, Jakob, Raileanu, Roberta, Lomeli, Maria

arXiv.org Artificial Intelligence

Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.


Compositional Generalization for Data-to-Text Generation

Xu, Xinnuo, Titov, Ivan, Lapata, Mirella

arXiv.org Artificial Intelligence

Data-to-text generation involves transforming structured data, often represented as predicate-argument tuples, into coherent textual descriptions. Despite recent advances, systems still struggle when confronted with unseen combinations of predicates, producing unfaithful descriptions (e.g. hallucinations or omissions). We refer to this issue as compositional generalisation, and it encouraged us to create a benchmark for assessing the performance of different approaches on this specific problem. Furthermore, we propose a novel model that addresses compositional generalization by clustering predicates into groups. Our model generates text in a sentence-by-sentence manner, relying on one cluster of predicates at a time. This approach significantly outperforms T5~baselines across all evaluation metrics.Notably, it achieved a 31% improvement over T5 in terms of a metric focused on maintaining faithfulness to the input.


LIMA: Less Is More for Alignment

Zhou, Chunting, Liu, Pengfei, Xu, Puxin, Iyer, Srini, Sun, Jiao, Mao, Yuning, Ma, Xuezhe, Efrat, Avia, Yu, Ping, Yu, Lili, Zhang, Susan, Ghosh, Gargi, Lewis, Mike, Zettlemoyer, Luke, Levy, Omer

arXiv.org Artificial Intelligence

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.


Meet the American who wrote the moon-landing software: Margaret Hamilton, computer whiz and mom

FOX News

Computer prodigy Hamilton was just 32 years old when Apollo 11 put men on the moon, guided by her innovative software that saved the mission from being aborted minutes before landing on the lunar surface. The Apollo 11 moon landing was one giant leap for womankind. Credit Margaret Hamilton, a 32-year-old mother and computer whiz at the Massachusetts Institute of Technology, who wrote the software that placed Neil Armstrong and Buzz Aldrin on the moon on July 20, 1969. She also worked on the five moon-landing missions that followed. The director of software engineering at MIT's Instrumentation Laboratory, Hamilton was a pioneer of computer science in a transformative era, and on a transformative mission, in human history.


Cybereum Newsletter Vol-4

#artificialintelligence

The energy consumption from crypto mining has been increasingly exponentially with the increasing adoption of crypto. This increasing becoming of concern as it should be. Large parts of the world suffer from energy deprivation due to unaffordability and inadequate energy generation. At the same time climate change goals will require the world to reduce net emission much of which is produced from electricity generation. Supporting the world's growth and generating the and while reducing emissions when large populations suffer from energy deficiency is a very difficult issue requires trillions is capital over the coming 2 decades.


MIT deepfake shows Nixon sadly saying the Moon astronauts died

#artificialintelligence

Because the mission succeeded, Nixon never delivered the speech, but MIT engineers used deepfake technology to create a news broadcast in which a digitally-reconstructed Nixon delivers the bad news, WBUR News reports. The deepfake, which will be presented at a film festival Friday, illustrates just how easy it is to make virtual puppets deliver convincing speeches, even if they're totally removed from history. Francesca Panetta, co-director of the larger film in which the deepfake appears, told WBUR that she had someone actually read the script while impersonating Nixon's intonation and then used software to make the recording sound even more like Nixon's voice. It's not the most advanced way to create deepfakes out there, but it still gets the job done. "I had one person say, 'Oh, so you got an impersonator to impersonate Nixon,'" she told WBUR.